As per word count constraints, I would like to give a quick and concise, but greatly needed appreciation to lectures Daqing Chen and George Bamfo for their guidance and supervision, especially during tutorials where they provided regular and bespoke feedback on the following Big Data Analytics project.
The following report outlines analysis of a dataset concinnated by LSBU and provided by Havering burough on car accidents occurrences along with other relevant variables which were present during the incident. Some examples of attributes provided along with accidents include, weather conditions, road conditions, police attendance and severity.
The dataset provided by the Department for Transport outlines accident information in the burough of Havering. The data provided comes with two excel files. One containing the main data, where most columns are coded with numbers. And another where the code has mappings to its labels. Due to the recent changes in the econumy, councils just like Havering have been facing a budget crunch. For example, an article released by (Romford Recorder, 2023) on the 5th of august 2023 outlined how Havering council reject a petition proposed by citizens where they requested a lolipop lady outside two schools. By conducting analysis on the data provided to us by Department for Transport, we would be able to implement and recommend safer alternatives to Havering council, in turn allowing them to signficantly increase citizen road safety.
An authority monitoring report published by (Havering Councul, 2021-2022) outlined that 57% of students in Havering walk to school. Though this is a part of their green initiative, the signficatly high number of pupils walking to school poses a great risk to pedesterian safety. This business question has been raised in section 1.2.
1) How is the severity of accidents impacted by weather conditions, road surface conditions and the light conditions?
2) How is pedestrian safety / number of casualties is affected by different road types, for example type of junction (junction detail), type of pedestrian crossing (Pedestrian Crossing-Physical) and Road Type?
3) What is the Severity of accidents in a particular road class impacted by the speed limit in rural vs urban areas?
To gain a better idea of the geographical location of Havering, a plotted map of accidents is shown below (interactive).
import folium
from folium.plugins import HeatMap
# Create a base map
m = folium.Map(location=[filtered_df['Latitude'].mean(), filtered_df['Longitude'].mean()], zoom_start=10)
# Create a HeatMap layer to visualize accident density
heat_data = [[row['Latitude'], row['Longitude']] for index, row in filtered_df.iterrows()]
HeatMap(heat_data).add_to(m)
# Add markers for individual accidents (optional)
for index, row in filtered_df.iterrows():
folium.Marker([row['Latitude'], row['Longitude']],
icon=folium.Icon(color='red', icon='info-sign')).add_to(m)
# Save the map as an HTML file
m.save('accidents_map.html')
# Display the map in Jupyter Notebook (optional)
m
Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder
from sklearn.decomposition import PCA
import scipy.stats as stats
from sklearn.preprocessing import MinMaxScaler, StandardScaler
Defining width/length of outputs
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 50)
pd.set_option('display.max_rows', 50)
Importing initial dataset
df = pd.read_csv('AccidentLondonBoroughs2223.csv')
Filtering by Havering Burough
filtered_df = df[df['Local_Authority_Highway'] == 'E09000016']
An overview of the dataset:
filtered_df.head()
The code below outputs the numbers of rows and columns (atributes) for the dataset provided by havering burough. There are 4203 rows and 32 columns/atributes.
num_of_instances = filtered_df.shape[0]
print ("Number of rows =", num_of_instances)
num_of_atributes = filtered_df.shape[1]
print ("Number of columns/atributes =", num_of_atributes)
The snippet below outlines the current datatypes. It is observed that they have not been converted to numerical this is expected considering the two excel sheets provided (main and mapping).
data_types = filtered_df.dtypes
print(data_types)
The below outputs indept stats of the initial dataset (pre changing datatypes). This includes the Count, Minimum, 25%, 50% (Median), 75%, Maximum, Range, Average, Standard Deviation, Kurtosis, Skewness, Mode.
# Calculate mode for all columns (both numeric and categorical)
mode_df = filtered_df.mode().iloc[0]
# Filter out only numeric columns for other statistics
numeric_filtered_df = filtered_df.select_dtypes(include='number')
# Calculate min, max, and range for numeric columns
min_df = numeric_filtered_df.min()
max_df = numeric_filtered_df.max()
range_df = max_df - min_df
# Calculate other statistics for numeric columns
average_df = numeric_filtered_df.mean()
median_df = numeric_filtered_df.median()
std_dev_df = numeric_filtered_df.std()
kurtosis_df = numeric_filtered_df.kurtosis()
skewness_df = numeric_filtered_df.skew()
# Calculate the quantiles
quantiles_df = numeric_filtered_df.quantile([0.25, 0.5, 0.75])
# Creating a summary DataFrame
summary_df = pd.DataFrame({
'Count': numeric_filtered_df.count(),
'Minimum': min_df,
'25%': quantiles_df.loc[0.25],
'50% (Median)': quantiles_df.loc[0.5], # same as 'Median': median_df
'75%': quantiles_df.loc[0.75],
'Maximum': max_df,
'Range': range_df,
'Average': average_df,
'Standard Deviation': std_dev_df,
'Kurtosis': kurtosis_df,
'Skewness': skewness_df,
'Mode': mode_df
})
# Transpose the summary DataFrame to get statistics on the x axis
summary_df_transposed = summary_df.T
# Display the transposed summary DataFrame
display(summary_df)
The following checks all cells for any null or -values and outputs the count of missing values. It can be seen the columns Junction_Control, 2nd_Road_Class and LSOA_of_Accident_Location contain missing values. It is worth noting that this is slightly misrepresented and will be explored further later.
filtered_df_copy = filtered_df.copy()
# Create a mask for -1 values
minus_one_mask = (filtered_df_copy == -1)
# Create a mask for NaN values
nan_mask = filtered_df_copy.isnull()
# Combine the masks
combined_mask = minus_one_mask | nan_mask
# Count the number of True values in each column
error_counts = combined_mask.sum()
# Filter to display counts where the number of errors is more than 0
error_counts_more_than_zero = error_counts[error_counts > 0]
print(error_counts_more_than_zero)
The code below imports the Road-Accident-Safety-Data-Guide.xls excel file which contains the mapping/coding for the categorical columns provided in the initial (AccidentLondonBoroughs2223.csv) excel file. It then maps the categorical values and coverts the data type. It applys this to the columns by taking parameters of the column with its respective sheet (in Road-Accident-Safety-Data-Guide.xls) (which contains the code and value).
data_guide_path = 'Road-Accident-Safety-Data-Guide.xls'
data_guide = pd.read_excel(data_guide_path, sheet_name=None)
# Function to apply the mapping from the data guide to the dataframe
def apply_mapping(df, column, sheet_name):
# Choose the correct code column ('code' or 'Code')
code_column = 'code' if 'code' in data_guide[sheet_name].columns else 'Code'
# Choose the correct label column ('label' or 'Label')
label_column = 'label' if 'label' in data_guide[sheet_name].columns else 'Label'
# Create a mapping dictionary from the 'code'/'Code' to the 'label'/'Label'
mapping_dict = pd.Series(data_guide[sheet_name][label_column].values, index=data_guide[sheet_name][code_column].astype(str)).to_dict()
# Apply the mapping to the dataframe column
df[column] = df[column].astype(str).map(mapping_dict).astype('category')
# Apply the mapping for each specified column
apply_mapping(filtered_df, 'Police_Force', 'Police Force')
apply_mapping(filtered_df, 'Accident_Severity', 'Accident Severity')
apply_mapping(filtered_df, 'Day_of_Week', 'Day of Week')
apply_mapping(filtered_df, 'Local_Authority_District', 'Local Authority (District)')
apply_mapping(filtered_df, 'Local_Authority_Highway', 'Local Authority (Highway)')
apply_mapping(filtered_df, '1st_Road_Class', '1st Road Class')
apply_mapping(filtered_df, 'Road_Type', 'Road Type')
apply_mapping(filtered_df, 'Junction_Detail', 'Junction Detail')
apply_mapping(filtered_df, 'Junction_Control', 'Junction Control')
apply_mapping(filtered_df, '2nd_Road_Class', '2nd Road Class')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Human_Control', 'Ped Cross - Human')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Physical_Facilities', 'Ped Cross - Physical')
apply_mapping(filtered_df, 'Light_Conditions', 'Light Conditions')
apply_mapping(filtered_df, 'Weather_Conditions', 'Weather')
apply_mapping(filtered_df, 'Road_Surface_Conditions', 'Road Surface')
apply_mapping(filtered_df, 'Special_Conditions_at_Site', 'Special Conditions at Site')
apply_mapping(filtered_df, 'Carriageway_Hazards', 'Carriageway Hazards')
apply_mapping(filtered_df, 'Urban_or_Rural_Area', 'Urban Rural')
apply_mapping(filtered_df, 'Did_Police_Officer_Attend_Scene_of_Accident', 'Police Officer Attend')
It can also be seen below that the relevant data types have been converted from numerical to categorical
data_types = filtered_df.dtypes
print(data_types)
The below outputs the new data frame after having mapped categorical values.
filtered_df.head()
After further examining the second excel sheet, it became apparent that empty values were not only represented as null or -1 but also as "Unknown", "Data missing or out of range", etc (EXAMPLES OF EXCEL SHOWN BELOW). The code below loops through all columns and outputs the number of missing values. This new code (after converting the datatypes) reveals many more missing values which may need to be addressed.
missing_values_indicators = [
"Data missing or out of range",
"Unclassified",
"Unknown",
"Unallocated"
]
# Initialize a dictionary to hold the count of missing values for each column
missing_or_null_value_counts = {}
# Iterate over each column in the DataFrame
for column in filtered_df.columns:
# Count the occurrences of each indicator string and NaN values in the column
# and sum them to get the total count of missing or null values for that column
missing_count = filtered_df[column].isnull().sum() \
+ filtered_df[column].astype(str).isin(missing_values_indicators).sum()
# Add to the dictionary only if the count is greater than 1
if missing_count > 1:
missing_or_null_value_counts[column] = missing_count
# Filter the results to only include columns with more than one missing value
missing_or_null_value_counts_filtered = {k: v for k, v in missing_or_null_value_counts.items() if v > 1}
# Convert the counts to a DataFrame for easy viewing, if not empty
if missing_or_null_value_counts_filtered:
missing_counts_df = pd.DataFrame.from_dict(missing_or_null_value_counts_filtered, orient='index', columns=['Missing/Null Value Count'])
else:
missing_counts_df = pd.DataFrame(columns=['Missing/Null Value Count'])
# Print or save the DataFrame as needed
print(missing_counts_df)
The code below counts/checks for any duplicate rows based on ID Accident_Index. There are none.
duplicated = filtered_df['Accident_Index'].duplicated().sum()
# If duplicated is greater than 0, there are duplicate values
print(f"Number of duplicate entries in 'Accident_Index': {duplicated}")
Based on the output below, there dont seem to be outliers for the speed limit, thought the frequency of 20mph is now, this does not necessarily mean its an outlier, it is instead likely just regular uncommon data. The number of casualties does have outliers, it could be said that any row where the number of casualties is > 6 is an outlier. This will be addressed later in data preperation
numeric_cols = ['Speed_limit', 'Number_of_Casualties']
for col in numeric_cols:
plt.figure(figsize=(6, 3))
plt.hist(filtered_df[col], bins=40, edgecolor='black')
plt.title(f'Histogram of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
Due to the data being categorical, it makes sense to identify outliers using counts. Based on the output below, some outliers are very cliear, for example, the occurence of accdiencts involving 6 or more vehicles is extremely rare. Another example is that, it is very uncommon for there to be more than 6 casualties during an accident.
# List of variables to calculate frequency
variables_to_count = [
'Police_Force', 'Accident_Severity', 'Number_of_Vehicles', 'Number_of_Casualties',
'Day_of_Week', '1st_Road_Class',
'Road_Type', 'Speed_limit', 'Junction_Detail', 'Junction_Control',
'2nd_Road_Class', 'Pedestrian_Crossing-Human_Control',
'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions', 'Weather_Conditions',
'Road_Surface_Conditions', 'Special_Conditions_at_Site', 'Carriageway_Hazards',
'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident'
]
# Loop through each variable and print its frequency without datatype
# Loop through each variable and print its frequency without datatype and without the index name
for variable in variables_to_count:
frequency = filtered_df[variable].value_counts()
frequency.index.name = None # Set the index name to None to avoid printing it
print(f"Frequency for {variable}:")
print(frequency.to_string()) # This will print without the index name
print("\n") # Adding a new line for better readability
The code below graphs all categorical columns against the number of casualties as a means to identify categorical outliers
# Columns to exclude from the bar charts
exclude_columns = [
'Number_of_Casualties', 'Accident_Index', 'Location_Easting_OSGR',
'Location_Northing_OSGR', 'Longitude', 'Latitude', 'Date', 'Time', 'Police_Force', 'Local_Authority_District', 'Local_Authority_Highway', 'LSOA_of_Accident_Location'
]
for column in filtered_df.columns:
if column not in exclude_columns:
group_data = filtered_df.groupby(column)['Number_of_Casualties'].sum().sort_values()
# Create a bar chart
group_data.plot(kind='bar', figsize=(3, 1))
# Set the title and labels
plt.title(f'Number of Casualties by {column}')
plt.xlabel(column)
plt.ylabel('Number_of_Casualties')
# Show the plot
plt.show()
The following code outputs the percentage of each attribute in a column. This useful in determining the business questions as it is not recomended to use columns which have more than 20% missing data
# Columns to exclude
exclude_columns = ['Accident_Index', 'LSOA_of_Accident_Location', 'Time', 'Date']
# Loop through categorical columns except the excluded ones and calculate class percentages
for column in filtered_df.select_dtypes(include=['object', 'category']).columns:
if column not in exclude_columns:
# Get the frequency of each class
class_distribution = filtered_df[column].value_counts()
# Calculate the percentage of each class
class_percentage = (class_distribution / len(filtered_df)) * 100
print(f"Class percentage for {column}:\n{class_percentage}\n")
# Optionally, you can set a threshold to flag the imbalance, like 80/20 rule.
if class_percentage.max() > 80:
print(f"{column} is imbalanced.\n")
# Define ANSI escape codes for red color and bold text
RED = '\033[91m'
BOLD = '\033[1m'
RESET = '\033[0m'
# Columns to exclude
exclude_columns = ['Accident_Index', 'LSOA_of_Accident_Location', 'Time', 'Date']
# Loop through categorical columns except the excluded ones and calculate class percentages
for column in filtered_df.select_dtypes(include=['object', 'category']).columns:
if column not in exclude_columns:
# Get the frequency of each class
class_distribution = filtered_df[column].value_counts()
# Calculate the percentage of each class
class_percentage = (class_distribution / len(filtered_df)) * 100
#threshold to flag the imbalance, 80/20 rule.
if class_percentage.max() > 80:
print(f"{RED}{BOLD}Warning: {column} is highly imbalanced.{RESET}\n")
for column in filtered_df.columns:
if pd.api.types.is_numeric_dtype(filtered_df[column]):
# It's a numeric column, so we can calculate min and max
min_value = filtered_df[column].min()
max_value = filtered_df[column].max()
print(f"{column} range: {min_value} to {max_value}")
else:
# It's not numeric, so we count unique values instead
unique_values = filtered_df[column].nunique()
print(f"{column} has {unique_values} unique values")
# Define the columns to check for outliers
# Function to detect outliers based on z-score for numeric columns only
def detect_outliers_z_score(data, column):
if pd.api.types.is_numeric_dtype(data[column]):
mean_val = data[column].mean()
std_val = data[column].std()
threshold = 3
# Find outliers
outliers = data[abs(data[column] - mean_val) > threshold * std_val][column]
return outliers
else:
return pd.Series() # Return an empty series for non-numeric columns
# Apply the function to each numeric column and store results
outliers_dict = {}
for column in filtered_df:
if pd.api.types.is_numeric_dtype(filtered_df[column]):
outliers_dict[column] = detect_outliers_z_score(filtered_df, column)
# Create a DataFrame from the dictionary of outliers
outliers_df = pd.DataFrame.from_dict(outliers_dict, orient='index').transpose()
print(outliers_df)
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder
# Assuming df and categorical_columns are defined as in your script
categorical_columns = [
'Weather_Conditions', 'Road_Surface_Conditions', 'Light_Conditions',
'1st_Road_Class', 'Urban_or_Rural_Area', 'Junction_Control',
'Pedestrian_Crossing-Physical_Facilities', 'Road_Type', 'Accident_Severity'
]
# One-hot encode these categorical columns
encoder = OneHotEncoder()
encoded_categorical = encoder.fit_transform(df[categorical_columns]).toarray()
# Apply PCA to reduce dimensions for visualization
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(encoded_categorical)
# Apply k-means clustering
kmeans = KMeans(n_clusters=5) # Adjust the number of clusters as needed
kmeans.fit(reduced_data)
# Create a scatter plot of the reduced data with the cluster labels
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=kmeans.labels_, cmap='viridis')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('Clusters from k-means')
# Create a color key (legend) for the clusters
unique_labels = set(kmeans.labels_)
for label in unique_labels:
plt.scatter([], [], color=plt.cm.viridis(label / max(kmeans.labels_)), label=f'Cluster {label + 1}')
plt.legend()
plt.show()
THE DATA TYPES WERE CONVERTED TO CATEGORICAL IN DATA UNDERSTANDING AS PER RECOMENDENDATION OF DAQING CHEN
ONVERTING DATATYPES (REPEAT)
The code below imports the Road-Accident-Safety-Data-Guide.xls excel file which contains the mapping/coding for the categorical columns provided in the initial (AccidentLondonBoroughs2223.csv) excel file. It then maps the categorical values and coverts the data type. It applys this to the columns by taking parameters of the column with its respective sheet (in Road-Accident-Safety-Data-Guide.xls) (which contains the code and value).
data_guide_path = 'Road-Accident-Safety-Data-Guide.xls'
data_guide = pd.read_excel(data_guide_path, sheet_name=None)
# Function to apply the mapping from the data guide to the dataframe
def apply_mapping(df, column, sheet_name):
# Choose the correct code column ('code' or 'Code')
code_column = 'code' if 'code' in data_guide[sheet_name].columns else 'Code'
# Choose the correct label column ('label' or 'Label')
label_column = 'label' if 'label' in data_guide[sheet_name].columns else 'Label'
# Create a mapping dictionary from the 'code'/'Code' to the 'label'/'Label'
mapping_dict = pd.Series(data_guide[sheet_name][label_column].values, index=data_guide[sheet_name][code_column].astype(str)).to_dict()
# Apply the mapping to the dataframe column
df[column] = df[column].astype(str).map(mapping_dict).astype('category')
# Apply the mapping for each specified column
apply_mapping(filtered_df, 'Police_Force', 'Police Force')
apply_mapping(filtered_df, 'Accident_Severity', 'Accident Severity')
apply_mapping(filtered_df, 'Day_of_Week', 'Day of Week')
apply_mapping(filtered_df, 'Local_Authority_District', 'Local Authority (District)')
apply_mapping(filtered_df, 'Local_Authority_Highway', 'Local Authority (Highway)')
apply_mapping(filtered_df, '1st_Road_Class', '1st Road Class')
apply_mapping(filtered_df, 'Road_Type', 'Road Type')
apply_mapping(filtered_df, 'Junction_Detail', 'Junction Detail')
apply_mapping(filtered_df, 'Junction_Control', 'Junction Control')
apply_mapping(filtered_df, '2nd_Road_Class', '2nd Road Class')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Human_Control', 'Ped Cross - Human')
apply_mapping(filtered_df, 'Pedestrian_Crossing-Physical_Facilities', 'Ped Cross - Physical')
apply_mapping(filtered_df, 'Light_Conditions', 'Light Conditions')
apply_mapping(filtered_df, 'Weather_Conditions', 'Weather')
apply_mapping(filtered_df, 'Road_Surface_Conditions', 'Road Surface')
apply_mapping(filtered_df, 'Special_Conditions_at_Site', 'Special Conditions at Site')
apply_mapping(filtered_df, 'Carriageway_Hazards', 'Carriageway Hazards')
apply_mapping(filtered_df, 'Urban_or_Rural_Area', 'Urban Rural')
apply_mapping(filtered_df, 'Did_Police_Officer_Attend_Scene_of_Accident', 'Police Officer Attend')
The code below drops all columns not relevent to the business questions stated in section 1
columns_to_keep = [
'Weather_Conditions', 'Road_Surface_Conditions', 'Light_Conditions',
'Speed_limit', 'Urban_or_Rural_Area', 'Junction_Detail',
'Pedestrian_Crossing-Physical_Facilities', 'Road_Type', 'Accident_Severity',
'Number_of_Casualties'
]
new_df = filtered_df[columns_to_keep].copy()
It an be observed that two of the columns requried for the business problem contain missing values. Through detecting impalance class percentages, it was determined that no more than 20% of the values in each column were missing, meaning it is safe to use said columns.
missing_values_indicators = [
"Data missing or out of range",
"Unclassified",
"Unknown",
"Unallocated"
]
# Initialize a dictionary to hold the count of missing values for each column
missing_or_null_value_counts = {}
# Iterate over each column in the DataFrame
for column in new_df.columns:
# Count the occurrences of each indicator string and NaN values in the column
# and sum them to get the total count of missing or null values for that column
missing_count = new_df[column].isnull().sum() \
+ new_df[column].astype(str).isin(missing_values_indicators).sum()
# Add to the dictionary only if the count is greater than 1
if missing_count > 1:
missing_or_null_value_counts[column] = missing_count
# Filter the results to only include columns with more than one missing value
missing_or_null_value_counts_filtered = {k: v for k, v in missing_or_null_value_counts.items() if v > 1}
# Convert the counts to a DataFrame for easy viewing, if not empty
if missing_or_null_value_counts_filtered:
missing_counts_df = pd.DataFrame.from_dict(missing_or_null_value_counts_filtered, orient='index', columns=['Missing/Null Value Count'])
else:
missing_counts_df = pd.DataFrame(columns=['Missing/Null Value Count'])
# Print or save the DataFrame as needed
print(missing_counts_df)
# missing_counts_df.to_csv('missing_or_null_value_counts.csv')
The code below outputs the frequency of the two columns containing missing values allowing us to impute them approperiatly. It was considered to replace missing values based on the class percentages and to maintain the same proportions (gaussian distribution). However after further consideration, this method would be the same as if we were to simple ignore missing value. So iw was decided to just use the mode
# List of variables to calculate frequency
variables_to_count = [
'Weather_Conditions', 'Road_Type'
]
# Loop through each variable and print its frequency with value descriptions
for variable in variables_to_count:
frequency = new_df[variable].value_counts().reset_index()
frequency.columns = ['Value', 'Frequency'] # Rename columns for clarity
print(f"Frequency for {variable}:")
for row in frequency.itertuples(index=False):
print(f"{row.Value} = {row.Frequency}")
print("\n") # Adding a new line for better readability
new_df['Weather_Conditions'].replace("Unknown", "Fine no high winds", inplace=True)
new_df['Road_Type'].replace("Unknown", "Single carriageway", inplace=True)
RECHECKING IF MISSING VALUES FILLED
After filling in the missing values, it can be seen that there are no more missing values and that the commands above have executed as expected
missing_values_indicators = [
"Data missing or out of range",
"Unclassified",
"Unknown",
"Unallocated"
]
# Initialize a dictionary to hold the count of missing values for each column
missing_or_null_value_counts = {}
# Iterate over each column in the DataFrame
for column in new_df.columns:
# Count the occurrences of each indicator string and NaN values in the column
# and sum them to get the total count of missing or null values for that column
missing_count = new_df[column].isnull().sum() \
+ new_df[column].astype(str).isin(missing_values_indicators).sum()
# Add to the dictionary only if the count is greater than 1
if missing_count > 1:
missing_or_null_value_counts[column] = missing_count
# Filter the results to only include columns with more than one missing value
missing_or_null_value_counts_filtered = {k: v for k, v in missing_or_null_value_counts.items() if v > 1}
# Convert the counts to a DataFrame for easy viewing, if not empty
if missing_or_null_value_counts_filtered:
missing_counts_df = pd.DataFrame.from_dict(missing_or_null_value_counts_filtered, orient='index', columns=['Missing/Null Value Count'])
else:
missing_counts_df = pd.DataFrame(columns=['Missing/Null Value Count'])
# Print or save the DataFrame as needed
print(missing_counts_df)
# missing_counts_df.to_csv('missing_or_null_value_counts.csv')
In order to find outliers for the numerical columns,
numeric_cols = ['Speed_limit', 'Number_of_Casualties']
for col in numeric_cols:
plt.figure(figsize=(5, 2))
plt.hist(filtered_df[col], bins=40, edgecolor='black')
plt.title(f'Histogram of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
The code below outputs the count of eacy type for easier readability
# List of variables to calculate frequency
variables_to_count = [
'Number_of_Casualties'
]
# Loop through each variable and print its frequency with value descriptions
for variable in variables_to_count:
frequency = new_df[variable].value_counts().reset_index()
frequency.columns = ['Value', 'Frequency'] # Rename columns for clarity
print(f"Frequency for {variable}:")
for row in frequency.itertuples(index=False):
print(f"{row.Value} = {row.Frequency}")
print("\n") # Adding a new line for better readability
The rows where there are nunber of casulaties more than 8 have been dropped. Though other columns has types where the frequency was low, i.e snow. It does not necessearly mean that they are outliers, it could simply be an uncommon occurance. Another example of this is the 20mph speed limit. It is rare, but not an outlier.
# Assuming 'filtered_df' is your DataFrame
new_df = new_df[new_df['Number_of_Casualties'] < 8]
# This will keep rows where 'Number_of_Casualties' is less than 8
After running the count loop again, i was able to confirm that the rows have dropped as expected
# List of variables to calculate frequency
variables_to_count = [
'Number_of_Casualties'
]
# Loop through each variable and print its frequency with value descriptions
for variable in variables_to_count:
frequency = new_df[variable].value_counts().reset_index()
frequency.columns = ['Value', 'Frequency'] # Rename columns for clarity
print(f"Frequency for {variable}:")
for row in frequency.itertuples(index=False):
print(f"{row.Value} = {row.Frequency}")
print("\n") # Adding a new line for better readability
# Calculate Z-scores of each column in the DataFrame
z_scores = stats.zscore(new_df.select_dtypes(include=[np.number]))
# Convert Z-scores to a DataFrame
z_scores_df = pd.DataFrame(z_scores, columns=new_df.select_dtypes(include=[np.number]).columns)
# Define a threshold for identifying outliers
threshold = 3
# Find where Z-scores are above the threshold
outliers = (z_scores_df.abs() > threshold).any(axis=1)
# Print the rows in the DataFrame that contain outliers
outliers_df = new_df[outliers]
display(outliers_df)
I then dropped any rows (see below) where that fit the ctireria above
new_df = new_df[~outliers]
ENSURING ROWS HAVE DROPPED
It can be seen below that this was done successfully.
# Calculate Z-scores of each column in the DataFrame
z_scores = stats.zscore(new_df.select_dtypes(include=[np.number]))
# Convert Z-scores to a DataFrame
z_scores_df = pd.DataFrame(z_scores, columns=new_df.select_dtypes(include=[np.number]).columns)
# Define a threshold for identifying outliers
threshold = 3
# Find where Z-scores are above the threshold
outliers = (z_scores_df.abs() > threshold).any(axis=1)
# Print the rows in the DataFrame that contain outliers
outliers_df = new_df[outliers]
display(outliers_df)
The code below outputs any imbalanced classes along with their percentages. The threshold is set to 80%. The second code snippet provides a list of columns which are unbalanced in a human readable format
# Assuming 'filtered_df' is your DataFrame and it's already loaded with data
# Loop through categorical columns and calculate class percentages
for column in new_df.select_dtypes(include=['object', 'category']).columns:
# Get the frequency of each class
class_distribution = new_df[column].value_counts()
# Calculate the percentage of each class
class_percentage = (class_distribution / len(new_df)) * 100
# Print the class percentages
print(f"Class percentage for {column}:\n{class_percentage}\n")
# Optionally, you can set a threshold to flag the imbalance
if class_percentage.max() > 80:
print(f"Warning: {column} is imbalanced. The majority class '{class_distribution.idxmax()}' accounts for {class_percentage.max()}% of the data. \n\n\n")
Though i was able to identify imbalanced classes. Due to limitations of knowledge, i was unable to successfully balance these columns without signficantly manipluating the data. SMOTE was considered, however it is generally only used for numerical classes/columns. As a work around, the chosen model will take in to account weighted sampling, as so to bypass the problem of imbalanced classes
# Define ANSI escape codes for red color and bold text
RED = '\033[91m'
BOLD = '\033[1m'
RESET = '\033[0m'
# Columns to exclude
exclude_columns = ['Accident_Index', 'LSOA_of_Accident_Location', 'Time', 'Date']
# Loop through categorical columns except the excluded ones and calculate class percentages
for column in new_df.select_dtypes(include=['object', 'category']).columns:
if column not in exclude_columns:
# Get the frequency of each class
class_distribution = new_df[column].value_counts()
# Calculate the percentage of each class
class_percentage = (class_distribution / len(new_df)) * 100
#threshold to flag the imbalance, 80/20 rule.
if class_percentage.max() > 80:
print(f"{RED}{BOLD}Warning: {column} is highly imbalanced.{RESET}\n")
One-hot encoding is the process where categorical variables are converted into a form that could be inputted into a ML algorithm (hackernoon.com, n.d.). It typically involves expanding columns and converting them in to a binary form. The code below transformes the needed categorical columns into binary features and stores it as encoded_df. One hot encoding is typically only applied to categorical types
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Define the columns to be one-hot encoded
categorical_cols = ['Weather_Conditions', 'Road_Surface_Conditions', 'Light_Conditions',
'Urban_or_Rural_Area', 'Junction_Detail', 'Pedestrian_Crossing-Physical_Facilities',
'Road_Type']
# Perform one-hot encoding
onehotencoder = OneHotEncoder(sparse=False)
encoded_features = onehotencoder.fit_transform(new_df[categorical_cols])
encoded_feature_names = onehotencoder.get_feature_names_out(categorical_cols)
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)
# Reset index to align the original and encoded DataFrames
new_df.reset_index(drop=True, inplace=True)
# Concatenate the one-hot encoded columns with the original DataFrame
new_df_encoded = pd.concat([new_df.drop(categorical_cols, axis=1), encoded_df], axis=1)
The code below normalises the numerical columns speed limit
from sklearn.preprocessing import MinMaxScaler
# Normalize numerical columns
scaler = MinMaxScaler()
new_df_encoded['Speed_limit'] = scaler.fit_transform(new_df_encoded[['Speed_limit']])
The datasets are then split up in to three tables based on the three business questions identified in section 1
from sklearn.model_selection import train_test_split
# For Question 1: Impact of conditions on accident severity
features_q1 = new_df_encoded.filter(regex='Weather_|Road_|Light_|Speed_limit')
target_q1 = new_df_encoded['Accident_Severity']
# For Question 2: Impact on pedestrian safety and number of casualties
features_q2 = new_df_encoded.filter(regex='Road_Type|Junction_|Pedestrian_')
target_q2 = new_df_encoded['Number_of_Casualties']
# For Question 3: Severity of accidents in different road classes and impact of speed limits
features_q3 = new_df_encoded.filter(regex='Road_Type|Speed_limit|Urban_or_Rural_')
target_q3 = new_df_encoded['Accident_Severity']
# Split the data into training and testing sets for each business question
X_train_q1, X_test_q1, y_train_q1, y_test_q1 = train_test_split(features_q1, target_q1, test_size=0.2, random_state=42)
X_train_q2, X_test_q2, y_train_q2, y_test_q2 = train_test_split(features_q2, target_q2, test_size=0.2, random_state=42)
X_train_q3, X_test_q3, y_train_q3, y_test_q3 = train_test_split(features_q3, target_q3, test_size=0.2, random_state=42)
MODELLING - Q1
The following code utilises a Random Forest classifier from scikit-learn. It focuses on predicting 'Accident_Severity' based on features related to weather conditions, road surface conditions, and light conditions. The code preprocesses the data, splits it into training and testing sets, and employs a Random Forest Classifier with balanced class weights to account for imbalanced classes. It then outputs the feature importances to assess variable significance. The code also evaluates the model's performance using a classification report.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
# Assuming 'new_df_encoded' is your preprocessed DataFrame ready for modeling
# and 'target_q1' is your target variable for question 1
# Define the features for Question 1
features_q1 = new_df_encoded.filter(regex='Weather_Conditions|Road_Surface_Conditions|Light_Conditions')
# Define the target variable for Question 1
target_q1 = new_df_encoded['Accident_Severity']
# Split the dataset into training and testing sets
X_train_q1, X_test_q1, y_train_q1, y_test_q1 = train_test_split(
features_q1,
target_q1,
test_size=0.2,
random_state=42
)
# Initialize the Random Forest Classifier with class weights
rf_classifier_q1 = RandomForestClassifier(class_weight='balanced', random_state=42)
# Train the classifier
rf_classifier_q1.fit(X_train_q1, y_train_q1)
# Get the feature importances
feature_importances = rf_classifier_q1.feature_importances_
# Create a DataFrame from the feature importances
feature_importances_df = pd.DataFrame({
'Feature': features_q1.columns,
'Importance': feature_importances
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)
# Display the feature importances DataFrame
display(feature_importances_df)
# Predict on the test set
y_pred_q1 = rf_classifier_q1.predict(X_test_q1)
# Evaluate the predictions
classification_report_q1 = classification_report(y_test_q1, y_pred_q1)
print(classification_report_q1)
MODELLING - Q2
The code below also uses Random Forest to predict the Number_of_Casualties. It uses features related to 'Road_Type,' 'Junction_Detail,' and 'Pedestrian_Crossing-Physical_Facilities' from the preprocessed DataFrame 'new_df_encoded.'
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
# Assuming 'new_df_encoded' is your preprocessed DataFrame ready for modeling
# and 'target_q2' is your target variable for question 2
# Define the features for Question 2
features_q2 = new_df_encoded.filter(regex='Road_Type|Junction_Detail|Pedestrian_Crossing-Physical_Facilities')
# Define the target variable for Question 2
target_q2 = new_df_encoded['Number_of_Casualties']
# Split the dataset into training and testing sets
X_train_q2, X_test_q2, y_train_q2, y_test_q2 = train_test_split(
features_q2,
target_q2,
test_size=0.2,
random_state=42
)
# Initialize the Random Forest Classifier with class weights
rf_classifier_q2 = RandomForestClassifier(
n_estimators=200,
max_depth=20,
min_samples_split=5,
class_weight='balanced_subsample',
random_state=42
)
# Train the classifier
rf_classifier_q2.fit(X_train_q2, y_train_q2)
# Get the feature importances
feature_importances_q2 = rf_classifier_q2.feature_importances_
# Create a DataFrame from the feature importances
feature_importances_df_q2 = pd.DataFrame({
'Feature': features_q2.columns,
'Importance': feature_importances_q2
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)
# Display the feature importances DataFrame
display(feature_importances_df_q2)
# Predict on the test set
y_pred_q2 = rf_classifier_q2.predict(X_test_q2)
# Evaluate the predictions
classification_report_q2 = classification_report(y_test_q2, y_pred_q2)
print(classification_report_q2)
MODELLING - Q3
The final mode uses Logistic Regression model from scikit-learn. It focuses on predicting 'Accident_Severity using features related to 'Road_Type,' 'Speed_limit,' and 'Urban_or_Rural' from the preprocessed DataFrame 'new_df_encoded.'.
A linear regression model can also be represented as a network (see below) (Daqing.C, 2023)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
# Assuming 'new_df_encoded' is your preprocessed DataFrame ready for modeling
# and 'target_q3' is your target variable for question 3
# Define the features for Question 3
features_q3 = new_df_encoded.filter(regex='Road_Type|Speed_limit|Urban_or_Rural')
# Define the target variable for Question 3
target_q3 = new_df_encoded['Accident_Severity']
# Split the dataset into training and testing sets
X_train_q3, X_test_q3, y_train_q3, y_test_q3 = train_test_split(
features_q3,
target_q3,
test_size=0.2,
random_state=42
)
# Initialize the Logistic Regression model with class weights and regularization
logreg_q3 = LogisticRegression(
class_weight='balanced',
max_iter=5000,
C=0.5, # Regularization strength
random_state=42
)
# Train the model
logreg_q3.fit(X_train_q3, y_train_q3)
# Get the model's coefficients
feature_coefficients = logreg_q3.coef_[0]
# Create a DataFrame for the coefficients
coefficients_df = pd.DataFrame({
'Feature': features_q3.columns,
'Coefficient': feature_coefficients
}).sort_values(by='Coefficient', ascending=False).reset_index(drop=True)
# Display the coefficients DataFrame
display(coefficients_df)
# Predict on the test set
y_pred_q3 = logreg_q3.predict(X_test_q3)
# Evaluate the predictions
classification_report_q3 = classification_report(y_test_q3, y_pred_q3)
print(classification_report_q3)
print ("Business Question 1")
print(classification_report_q1)
print("\n""\n""\n""\n")
print ("Business Question 2")
print(classification_report_q2)
print("\n""\n""\n""\n")
print ("Business Question 3")
print(classification_report_q3)
print()
Model Used: Random Forest Classifier
Insights: The feature importances from the Random Forest model suggest that certain weather conditions, like "Fine + high winds," and light conditions, such as "Darkness - no lights," are significant predictors of accident severity. This indicates that extreme weather and poor lighting may contribute to more severe accidents.
The model shows a significant imbalance with a tendency to correctly predict 'Slight' severity accidents, while 'Fatal' and 'Serious' severities have very low precision and recall.
Model Used: Random Forest Classifier (with Grid Search CV for hyperparameter tuning)
Insights: The model has likely identified specific road types and pedestrian crossing facilities that correlate with an increased number of casualties. For example, certain types of junctions or crossings without adequate safety measures might see higher casualty numbers.
The model performs well for predicting single-casualty incidents but performs poorly for incidents with more casualties.
Model Used: Logistic Regression
Insights: The coefficients from the Logistic Regression model can be interpreted to understand the impact of different road classes and speed limits on the severity of accidents. For instance, higher speed limits in urban areas might be associated with a certain level of accident severity.
The model has low accuracy and is heavily biased towards predicting the majority class 'Slight'. It demonstrates a high recall for 'Fatal' and 'Serious' but with very low precision, indicating a high number of false positives for these classes.
[1] Romford Recorder. (2023). Petition for lollipop people outside two Havering schools rejected. [online] Available at: https://www.romfordrecorder.co.uk/news/23690203.havering-council-rejects-petition-school-lollipop-people/ [Accessed 4th Nov. 2023].
[2] hackernoon.com. (n.d.). What is One Hot Encoding? Why And When do you have to use it? | Hacker Noon. [online] Available at: https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f.
[3] Daqing. C (2023), 09 Regression, [PowerPoint], Computer Science 4637_2122, London South Bank University, delivered week 9.
[4] Chris, K. (2022). Database Normalization – Normal Forms 1nf 2nf 3nf Table Examples. [online] freeCodeCamp.org. Available at: https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/.
[5] Scikit-learn.org. (2014). sklearn.linear_model.LogisticRegression — scikit-learn 0.21.2 documentation. [online] Available at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.
[6] Real Python (2019). Linear Regression in Python. [online] Realpython.com. Available at: https://realpython.com/linear-regression-in-python/.
[7] funnel.io. (n.d.). What is data mapping? Data mapping explained (with examples). [online] Available at: https://funnel.io/blog/your-guide-to-data-mapping [Accessed 1st Dec. 2023].
[8] GeeksforGeeks. (2020). Box Plot in Python using Matplotlib. [online] Available at: https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/.
[9] www.w3schools.com. (n.d.). Python Machine Learning - K-means. [online] Available at: https://www.w3schools.com/python/python_ml_k-means.asp.